An Efficient Algorithm for De-duplication of Demographic Data

نویسندگان

  • Vandana Dixit Kaushik
  • Amit Bendale
  • Aditya Nigam
  • Phalguni Gupta
چکیده

This paper proposes an efficient algorithm to de-duplicate based on demographic information which contains two name strings, viz. GivenName and Surname, of individuals. The algorithm consists of two stagesenrolment and de-duplication. In both stages, all name strings are reduced to generic name strings with the help of phonetic based reduction rules. Thus there may be several name strings having same generic name and also there may be many individuals having the same name. The generic name with all name strings and their Ids forms a bin. At the enrolment stage, a database with demographic information is efficiently created which is an array of bins and each bin is a singly linked list. At the de-duplication stage, name strings are reduced and all neighbouring bins of the reduced name strings are used to determine the top k best matches. In order to see the performance of the proposed algorithm, we have considered a large demographic database of 4,85,136 individuals. It has been observed that the phonetic reduction rules could reduce both the name strings by more than 90%. Experimental results reveal that there is very high hit rate against a low penetration rate.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Accurate Fruits Fault Detection in Agricultural Goods using an Efficient Algorithm

The main purpose of this paper was to introduce an efficient algorithm for fault identification in fruits images. First, input image was de-noised using the combination of Block Matching and 3D filtering (BM3D) and Principle Component Analysis (PCA) model. Afterward, in order to reduce the size of images and increase the execution speed, refined Discrete Cosine Transform (DCT) algorithm was uti...

متن کامل

An Efficient Meta Heuristic Algorithm to Solve Economic Load Dispatch Problems

The Economic Load Dispatch (ELD) problems in power generation systems are to reduce the fuel cost by reducing the total cost for the generation of electric power. This paper presents an efficient Modified Firefly Algorithm (MFA), for solving ELD Problem. The main objective of the problems is to minimize the total fuel cost of the generating units having quadratic cost functions subjected to lim...

متن کامل

Double Cervix with Normal Uterus and Vagina - An Unclassified Müllerian Anomaly

Müllerian anomalies are very common, and a frequent cause of infertility. The most used classification system until now, proposed by the American Society for Reproductive Medicine in 1988, categorizes comprehensively uterine anomalies but fails to classify defects of the cervix or vagina. This is based on a developmental theory that postulates that müllerian duct fusion is unidirectional, begin...

متن کامل

An Efficient Genetic Algorithm to Solve the Intermodal Terminal Location problem

The exponential growth of the flow of goods and passengers, fragility of certain products and the need for the optimization of transport costs impose on carriers to use more and more multimodal transport. In addition, the need for intermodal transport policy has been strongly driven by environmental concerns and to benefit from the combination of different modes of transport to cope with the in...

متن کامل

Efficient Data Mining with Evolutionary Algorithms for Cloud Computing Application

With the rapid development of the internet, the amount of information and data which are produced, are extremely massive. Hence, client will be confused with huge amount of data, and it is difficult to understand which ones are useful. Data mining can overcome this problem. While data mining is using on cloud computing, it is reducing time of processing, energy usage and costs. As the speed of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012